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Mutations within IncRNAs are effectively selected 
against in fruitfly but not in human 

Wilfried Haerty" and Chris P Renting 
Abstract 

Background: Previous studies in Drosophilo and mammals liave revealed levels of long non-coding RNAs 
(IncRNAs) sequence conservation that are intermediate between neutrally evolving and protein-coding sequence. 
These analyses compared conservation between species that diverged up to 75 million years ago. However, 
analysis of sequence polymorphisms within a species' population can provide an understanding of essentially 
contemporaneous selective constraints that are acting on IncRNAs and can quantify the deleterious effect of 
mutations occurring within these loci. 

Results: We took advantage of polymorphisms derived from the genome sequences of 163 Drosophilo 
melonogoster strains and 174 human individuals to calculate the distribution of fitness effects of single nucleotide 
polymorphisms occurring within intergenic IncRNAs and compared this to distributions for SNPs present within 
putatively neutral or protein-coding sequences. Our observations show that in D.melonogoster there is a significant 
excess of rare frequency variants within intergenic IncRNAs relative to neutrally evolving sequences, whereas 
selection on human intergenic IncRNAs appears to be effectively neutral. Approximately 30% of mutations within 
these fruitfly IncRNAs are estimated as being weakly deleterious. 

Conclusions: These contrasting results can be attributed to the large difference in effective population sizes 
between the two species. Our results suggest that while the sequences of IncRNAs will be well conserved across 
insect species, such loci in mammals will accumulate greater proportions of deleterious changes through genetic 
drift. 
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Background 

Although protein coding sequence occupies a Uttle over 
1% of the human genome, approximately 10-fold more 
non-coding sequence is predicted to have been under 
purif)^ing selection [1]. For smaller genomes, larger pro- 
portions (for example, 50% of all Drosophila sequence) 
have been predicted to have been under selective con- 
straints [13]. These estimates are founded on the 
assumption that sequence conservation is caused not by 
low rates of mutation, but instead by the high rates at 
which deleterious alleles are purged from the population 
by natural selection, an assumption that is well sup- 
ported [47]. 

A considerable fraction of conserved non-coding 
sequences in human and fruitfly genomes are tran- 
scribed [8,9]. Non-coding transcripts can be classified 
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into small RNAs (<200 nt, such as microRNA) and long 
RNAs (>200 nt, IncRNA). Many IncRNAs are spUced 
and/or polyadenylated [10], and they show tendencies to 
contain a smaller number of exons than protein coding 
genes and to be expressed in a tissue and/or develop- 
mental stage-specific manner [11] -[13]. 

A handful of IncRNAs have been functionally charac- 
terised as being involved in dosage compensation in 
either human {Xist [14]) or Drosophila {roXl, roX2 
[15]), or having roles in imprinting or chromatin modifi- 
cation {AIRN [16]; HOTAIR [17]), in alternative splicing 
regulation or in cell differentiation {MALATl, Tugl 
[18] -[20]). More broadly many IncRNAs appear to be 
involved in gene expression regulation in either cis or 
trans, through the local modification of chromatin and/ 
or direct interaction with protein complexes, DNA or 
RNA sequences [ll,12,21]-[23]. Recently IncRNAs have 
also been associated with the maintenance of embryonic 
stem cell pluripotency [24,25]. Furthermore, there is 
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limited evidence to linl< some IncRNAs, such as ANRIL 
or HOT AIR, to human pathologies [26,27]. However, the 
functional contribution to biology from the vast majority 
of long non-coding RNAs (IncRNAs) remains unknown. 

If a IncRNA has retained functionality over a long 
evolutionary time-period then mutations that abolish or 
diminish the function would be deleterious and would 
preferentially be purged from the species lineage. This 
would be reflected in a greater level of sequence conser- 
vation between species. Indeed, IncRNAs have been 
found to be significantly better conserved between spe- 
cies than are putatively neutrally evolving sequences, 
such as ancestral repeats in mammals [28] -[30] or small 
introns in Drosophila [13]. Furthermore, mammalian 
IncRNAs are enriched in conserved sequences identified 
either by elevated conservation (for example, phastCons 
[2]) scores or by applying a neutral model based on 
sequence insertions and deletions [28,30]. Additionally, 
increased conservation of the dinucleotide splice sites 
and a suppressed transversion rate have also been 
reported for mammals [28]. However, in each organism 
analysed thus far, IncRNA sequences have been shown 
to diverge far more rapidly than have protein-coding 
sequences [13,28]-[31]. These observations indicate an 
intermediate state in selective constraints between pro- 
tein-coding sequences and neutrally evolving sequences. 
The rapid divergence of IncRNA sequences between 
species complicates the identification of orthologous 
sequences for many of the IncRNA loci. Therefore, 
instead of nucleotide conservation, the conservation of 
orientation and position relative to an orthologous pro- 
tein coding-gene can be used to define positionally 
equivalent IncRNAs between species [13,32]. 

To date, most evolutionary analyses on IncRNAs have 
been conducted at the interspecies level using species 
that diverged approximately 75 million (human - mouse 
[28]) or 5 million years {Drosophila melanogaster - D, 
simulans [13]) ago. Although there is mounting evi- 
dence for purifying selection acting on IncRNAs, we 
note that previous analyses have used only a single 
reference genome per species. Previous studies reported 
an increased conservation level relative to a neutral 
reference [13,28] -[30], but they have not directly deter- 
mined the strength of selection acting on these non- 
coding sequences nor do they provide an understanding 
of the fitness effects of mutations, in terms of the pro- 
duct of the effective population size {Ne) and selection 
coefficient (5), occurring within these transcripts. 

It is important to compare interspecific indicators of 
constraint to intraspecific estimates of fitness effects 
since recent findings have demonstrated rapid evolution 
of IncRNAs that are specific to individual lineages [33]. 
A comparison between species can inform on past 
events but rarely does it have the power to identify 



contemporaneous or lineage-specific selective con- 
straints. Even when employing comparisons among mul- 
tiple species it is challenging to ascertain, within a 
specific lineage, the nature and the strength of the selec- 
tive pressures acting on rapidly evolving loci. 

For instance, the HOT AIR locus has evolved rapidly 
since the last common ancestor of mouse and human 
and differences in the consequences of knockout in 
these species' cell lines have been interpreted as indicat- 
ing the evolution of lineage specific biological functions 
[34]. Additionally, it was recently demonstrated that 
expression of a large number of IncRNA loci has altered 
rapidly among murid lineages [33]. Consequently, a low 
level of sequence conservation between two species 
could reflect, at one extreme, a historically low level of 
sequence constraint in both lineages, or, at the other 
extreme, it could reflect sequence that is constrained in 
only a portion of a single species lineage. Deciding 
among this range of possibilities relies on determining 
constraint within extant populations, for example by 
identifying whether derived low frequency alleles are 
enriched, relative to neutral sequence, within human or 
Drosophila IncRNA sequence [35]. A recent study indi- 
cated that this was, indeed, the case for human IncRNAs 
identified by the ENCODE consortium [36]. 

In such studies we need to consider that most human 
variants are recent [7,37], and there is a negative corre- 
lation between the age of the variant and its deleterious 
effect [7]. Consequently the bulk of deleterious muta- 
tions within a species are less likely to be detected when 
comparing distantly-related species as they will not 
often reach fixation.Therefore inter-species comparison 
will focus on substitutions events that are at most 
weakly deleterious as deleterious mutations are rarely 
fixed. Once again this underscores the importance of 
analysing, at the population level, nucleotide variation 
occurring within IncRNA loci if we are to better under- 
stand the relationships linking their evolution and func- 
tion. A potentially important confounding issue that 
needs to be considered in such analyses is that of back- 
ground selection as well as selective sweeps, where 
selection at one site reduces genetic diversity, but not 
divergence, at linked sites [38]. To account for this 
effect, variation at tested sites needs to be compared 
against variation in physically linked putatively neutral 
sites. 

For this study, we have taken advantage of recent 
high-throughput sequencing projects win D, melanoga- 
ster [39] and humans [37] [40], and the annotation of 
intergenic IncRNAs in both species [13,41]. The avail- 
ability of these large population datasets permits poly- 
morphism and divergence distributions to be 
investigated in both species across both coding and 
non-coding gene models. If the function of a IncRNA 
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locus is mediated through the act of transcription rather 
than through the RNA transcript itself [42,43] then we 
expect no difference in nucleotide conservation between 
exons and introns. In contrast, if the spliced transcript 
primarily has a RNA sequence-dependent function then 
its exonic sequence is expected to be well-conserved 
relative to its introns, as has been observed for protein- 
coding genes [44]. 

Our results reveal hitherto unappreciated distinctions 
in constraint between IncRNA exons and introns which 
are abundantly evident for Drosophila but are far less so 
for humans. In Drosophila striking differences in conser- 
vation between exons and introns suggest that the 
spliced transcript is often important in mediating the 
biological functions of IncRNA loci. Our analysis of site 
frequency spectra indicates that purifying selection has 
been effective on D, melanogaster IncRNA sequence but, 
importantly, not on human IncRNAs. Selection on 
mutations within human IncRNAs appear to be effec- 
tively neutral as a consequence of our species' unusually 
low effective population size. 

Results 

Conservation of intergenic IncRNA exons in Drosophila 

Our previous evolutionary rate analyses of Drosophila 
[13] or mammalian [28,30,45] intergenic IncRNAs con- 
sidered the degree of constraint associated with tran- 
scribed IncRNA sequence under the assumption that 
small introns and preserved transposable element 
sequences (ancestral repeats') evolve neutrally [3,46-48]. 

We extended these analyses firstly by addressing the 
issue of whether, as for protein-coding sequence [44], exo- 
nic sequence is better conserved than intronic sequence. 
To do this we performed a metagene analysis by recording 
the median phastCons scores of decile portions for the 
first, middle or last exons, or their intervening introns, of 
1,115 fruitfly and 4,662 human IncRNAs (Figure 1). 

For Drosophila IncRNAs, we observed a strong con- 
trast in median phastCons scores between their exons 
and their introns (Figure 1). While protein-coding exons 
exhibit the greatest degree of conservation, as expected 
IncRNA exons are associated with intermediate conser- 
vation levels, greater than those for protein-coding or 
IncRNA introns or indeed randomly sampled intergenic 
sequence (P<0.001, Figure lA). Strong purifying selec- 
tion in exonic, but not intronic, sequence implies that 
the molecular functions of these multi-exonic fruitfly 
IncRNAs are predominantly RNA-sequence specific 
rather than requiring only the process of transcription, 
for example during chromatin remodelling [11,42,43]. 

Performing the identical analysis on a set of human 
IncRNAs [41] revealed their median phastCons scores to 
be low not just for introns but also for exons (Figure 
IB). There is a significantly greater conservation for 



IncRNA exons compared with introns {P < 0.05) except 
for the 3' last-most exon whose conservation is not sig- 
nificantly different to that of introns {P >0.05 in all com- 
parisons. Additional File 1). Moreover, sequence 
conservation in human IncRNA exons or introns is little 
different from conservation of intergenic sequence. We 
found similar results when using different human 
IncRNA sets as well as a set of positionally equivalent 
IncRNAs between human and mouse (Additional File 2). 

Interestingly, when, instead of median values, mean 
phastCons scores for human IncRNA exons are consid- 
ered, these are marginally higher than intronic scores 
(Additional File 1). We conclude from these observa- 
tions that there is substantial heterogeneity in conserva- 
tion among human IncRNA loci, yet sequence for the 
majority of such loci shows little or no conservation. 

We noted that D. melanogaster IncRNAs exhibit no 
elevation of phastCons scores at their 5' or 3' splice 
sites using either the median or mean conservation 
scores (Figure lA, Additional File 1). To investigate this 
further we compared the conservation of splice site 
dinucleotides ('GT' and 'A'G') across five species with 
randomly selected 'GT' and 'AG' dinucleotides yet found 
no significant difference in their levels of conservation 
(Additional File 3). One conceivable explanation is that 
across the approximate 300 million years of evolution 
represented in the Diptera and Coleoptera phastCons 
scores, splice site dinucleotides have been conserved less 
than over the approximate 450 million years represented 
in the vertebrate phastCons scores. 

Lowered polymorphism levels within intergenic IncRNA 
exons relative to introns 

The conservation analysis that we present above illus- 
trates qualitatively the relative conservation between 
exons or introns, and differences in constraint between 
fruitfly and mammalian IncRNA sequences. This analy- 
sis is based on aligned sequences from highly divergent 
species and therefore provides us with evidence on past 
selection but unfortunately not on more contemporary 
evolutionary processes. To address this, we looked to 
DNA polymorphism data from both D. melanogaster 
and human populations. 

We considered 2,263,316 polymorphic sites in D, mel- 
anogaster and 12,640,342 in human, and used pairwise 
alignments with D. simulans and D. yakuba, or with P, 
troglodytes and M. mulatta, respectively to polarise 
SNPs for D, melanogaster or human according to 
whether they were ancestral or derived using maximum 
parsimony (Table 1). For all subsequent analyses, we 
compared observed levels of polymorphism and diver- 
gence within IncRNA loci to polymorphism and diver- 
gence observed within putatively neutrally evolving 
sequences such as small introns (< %6nt) in Drosophila 
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Figure 1 Median sequence conservation (phastCons) score across protein coding (blue) and IncRNA (red) exons and introns in D 

melonogoster (A) and in human (B). Non-overlapping windows each comprising 10% of the sequences were used. The shaded areas represent 
the 95% confidence intervals over the median. The grey lines represent the median scores computed using 1,000 resampling of intergenic 
sequences matching the IncRNA size distribution. 
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Table 1 Number of polarised polymorphic sites among 
162 D. 



Feature 


D. melanogaster 


H. sapiens 


Total 


2,263,316 


12,640,342 


IncRNA exons 


29,535 


49,505 


Ancestral repeats 




317,098 


Others 


921,066 


8,039,366 



melanogaster strains and among 174 liumans of African origin. The ancestral 
and derived states for each SNP were defined using alignments of D. 
melanogaster with D. simulans and D. yakuba and of H. sapiens with 
P. troglodytes and M. mulatta. 



[3,46,48] and ancestral repeats in human [47]. Impor- 
tantly, in order to take into account potential variation 
in local rates of mutation and/or substitution as well as 
nucleotide content in human or Drosophila, we limited 
our analyses to just those protein-coding genes that 
flank intergenic IncRNAs. Additionally, we considered 
only small introns present within protein coding genes 
that are direct neighbours and within 5 kb of IncRNA 
loci in D. melanogaster and only ancestral repeats found 
within intergenic sequences that are direct neighbours 
of mammalian IncRNA loci. We retained only those 
IncRNA loci for which matching small introns or ances- 
tral repeats could be identified. 

For both human and Drosophila, we observed a lower 
density of polymorphic sites within protein-coding exons 
than in introns {P <0.001 in both species), which indicates 
strong negative selection having acted on these exons. 
Although similar trends were observed for IncRNAs, differ- 
ences in SNP densities for IncRNA exons and introns were 
not significant (P >0.05 in both species. Tables 2 and 3). 

The ratio of D, melanogaster polymorphism to D, mel- 
anogaster-D. simulans divergence within IncRNA exons 
or introns was compared to that of small introns or ran- 
domly sampled flanking intergenic sites. The significant 
excess of polymorphism with respect to divergence 
within IncRNA exons {x^ test, P <0.001), but not 
introns (;^2 test, P >0.05, Figure 2), illustrates the 
strength of purifying selection acting on fruitfly 
IncRNAs, and specifically their exons. 

Evidence for strong purifying selection on Intergenic 
IncRNAs In Drosophila 

Next, to test for the strength of selection within exons 
or introns from fruitfly or human IncRNA loci, we 



compared the nucleotide variation within IncRNAs and 
protein coding exons and introns to putatively neutral 
sequences using the average number of pairwise nucleo- 
tide differences per sites {ttT , 6W [49,50]), and Tajima's 
D [51] which tests for departures from neutrality. We 
also assessed the nucleotide divergence between D. mel- 
anogaster-D, simulans, and human-macaque using the 
Jukes-Cantor corrected divergence {k [52]). 

As expected, we inferred stronger selective constraints 
on the protein-coding exons and introns of fruitfly genes, 
owing to their lower Tajima's D and divergence (/c), than 
for small introns, our neutral evolution proxy (Kruskal- 
Wallis test, P <0.05 in all comparisons. Table [2]). Like- 
wise, D. melanogaster IncRNA exons and introns were 
associated with lower Tajima's D and k values relative to 
our neutral sequence proxy, namely small introns 
{P <0.001 in both comparisons). Greater selective con- 
straint on Drosophila IncRNA exonic sequence was 
observed: values for IncRNA exons were significantly lower 
than for IncRNA introns {P <0.01 in both comparisons). 
Although we found no difference in ttT , 6W or Tajima s D 
values between IncRNAs and protein coding upstream 
sequences {P >0.05 in all comparisons), we found IncRNA 
upstream sequences to be less diverged than those of pro- 
tein coding sequences {P <0.001). This observation of 
lower interspecific divergence is likely to be the conse- 
quence of IncRNA gene models being incomplete, which 
in turn is a consequence of their low expression levels. 

Like fruitflies, human protein coding exons are under 
stronger selective constraints than either IncRNA exons, 
introns or protein-coding introns as indicated by lower 
TiT , Tajima s D and k values [P <0.001, Table [3]). In 
contrast to Drosophila, we found no significant differ- 
ence in Tajima's D values computed for human IncRNA 
exons, introns and their flanking ancestral repeats. Addi- 
tionally intergenic IncRNAs that are positional equiva- 
lents between human and mouse do not show a 
significant reduction of polymorphism or Tajima's D 
value relative to a control set of intergenic IncRNAs {P 
>0.05, Table [3], Additional Files 5 and 6). 
Excess of low frequency variants in Drosophila intergenic 
IncRNAs relative to neutral sequences 
We next compared the derived allele frequency spectra 
of polymorphic sites within fruitfly IncRNA exons to 
those within small introns. This revealed that IncRNA 
exons have a significantly higher proportion of SNPs 



Table 2 Average (standard deviation) polymorphism estimates for IncRNA loci and their flanking protein coding 
genes (within 5 kb) in D. 

Upstream coding 4.8 x 10"^ (4.4 x 10"^) 5.39 x 10"^ (3.8 x 10"^) -0.36 (0.97) 0.095 (0.74) 

IncRNA exons 4.94 x 10"^ (3.2 x 10"^) 5.88 x 10"^ (3.2 x 10"^) -0.53 (0.81) 0.064 (0.072) 

Small introns 1.01 x 10"^ (1.12 x 10"^) 8.96 x 10"^ (8.16 x 10"^) 0.15 (1.17) 0.115 (0.10) 



melanogaster. 
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Table 3 Average (standard deviation) polymorphism estimates for IncRNA and their flanking protein coding genes in 
human. 



Upstream coding 


1.05 X 10"^ 


(1 X 10"^) 


1.03 X 10"^ 


(0.07 X 10"^) 


0.003 (0.91) 


1.51 x"^ 


(1.74 X 10"^) 


IncRNA exons 


1.06 X 10"^ 


(8.85 X 10"^) 


1.16 X 10"^ 


(6.91 X 10"^) 


-0.21 (0.99) 


1.59 x"^ 


(1.58 X 10"^) 


Upstream IncRNA 


1.09 X 10"^ 


(1.07 X 10"^) 


1.19 X 10"^ 


(8.26 X 10"^) 


-0.14 (0.92) 


1.64 x"^ 


(1.79 X 10"^) 


PE IncRNA exons 


9.73 X 10"^ 


(7.87 X 10"^) 


1.13 X 10"^ 


(6.61 X 10"^) 


-0.27 (0.88) 


1.46 x"^ 


(1.41 X 10"^) 


PE IncRNA introns 


1.04 X 10"^ 


(7.62 X 10"^) 


1.08 X 10"^ 


(4.67 X 10"^) 


-0.20 (0.77) 


1.42 x"^ 


(9.2 X 10"^) 


Controls IncRNA exons 


1.04 X 10"^ 


(8.62 X 10"^) 


1.15 X 10"^ 


(6.57 X 10"^) 


-0.22 (0.85) 


1.46 x"^ 


(1.54 X 10"^) 


Controls IncRNA introns 


9.84 X 10"^ 


(6.48 X 10"^) 


1.08 X 10"^ 


(5.09 X 10"^) 


-0.26 (0.75) 


1.47 x"^ 


(1.33 X 10"^) 


Ancestral repeats 


1.51 X 10"^ 


(1.81 X 10"^) 


1.68 X 10"^ 


(1.14 X 10"^) 


-0.13 (0.92) 


2.34 x"^ 


(3.48 X 10"^) 



PE: position equivalent. 



with low frequency (<0.01) derived alleles (Kolmogorov- 
Smirnov test, P <0.001). This indicates that they have 
been subject to a greater degree of purifying selection in 
these fruitflies' recent evolution, since their divergence 
with D. simulans (Figure 3). This effect was not solely 
due to a G + C enrichment of conserved non-coding 
regions relative to non-conserved non-coding regions 
[53] since significant enrichment for low frequency 
derived alleles was observed for both G:C ^A:T and A: 
T^G:C substitutions in IncRNA exons (Kolmogorov- 
Smirnov tests P <0.001 in both comparisons) relative to 
small introns. The strength of purifying selection for 
fruitfly IncRNA exons appears to be lower than for non- 
synonymous or 3' UTR SNPs in protein-coding tran- 
scripts but stronger than for SNPs in their 5' UTRs or 
four-fold degenerate sites (Additional File 7). We 
observed that sequences upstream of the IncRNA loci in 
D. melanogaster are also enriched in low frequency var- 
iants relative to small introns or to upstream sequences 
of protein-coding genes (Additional File 8). This could 
reflect purifying selection acting on these elements and/ 
or the presence of unannotated upstream IncRNA 
exons. 

An equivalent analysis on the set of human IncRNAs, 
using data from the 1000 Genomes Project [40], 
revealed no enrichment of rare variants within human 
IncRNA exons relative to candidate neutrally evolving 
sequences such as four fold degenerate sites, introns or 
ancestral repeats {P >0.05, Figure 3). This result is 
important in allowing us to extend from our previous 
observation of a low degree of conservation between 
species, to effectively neutral or weak negative selection 
occurring since the emergence of modern humans. We 
similarly found that the derived allele frequency (DAF) 
of SNPs within positionally conserved IncRNAs does not 
depart significantly from the distribution observed for 
neighbouring ancestral repeats. While we observe a 
departure in the human IncRNA SNP DAF with respect 
to that for ancestral repeats sampled genome-wide, this 
is likely attributable to the effects of background 



selection: negative selection acting on the genomically 
proximal protein-coding genes. 

Deleterious effect of mutations within intergenic IncRNAs in 
fruitfly but not in human 

In our final analysis we estimated the distribution of fit- 
ness effects of new mutations within D. melanogaster or 
human IncRNA exons from their respective site fre- 
quency spectra. Because the DAF spectra can be influ- 
enced by past variation in effective population size, we 
employed the method of Keightley and Eyre-Walker 
[54] that estimates the distribution of fitness effect of 
new mutations and demographic parameters from the 
folded frequency spectrum. 

As our proxy for neutrally evolving sequence we con- 
sidered site frequency spectra from sites randomly 
sampled within flanking intergenic sequences. Likewise, 
we used four-fold degenerate sites as neutral proxy 
when calculating the distribution of fitness effect of new 
mutations at 0-fold degenerate sites. In fruitflies, two- 
thirds of mutations in IncRNA exons are predicted to be 
effectively neutral {Nes <1; 64.18%, 95% CI 63.8% to 
64.5%) while one-third are likely to be deleterious {Nes 
>1; 35.82%, 95% CI 35.0% to 36.6%). In stark contrast, 
no mutations in human IncRNAs were classified in this 
analysis as being deleterious, including those IncRNAs 
with positional equivalents in mouse. Consequently, we 
predict that the great majority of substitutions in human 
IncRNA sequence are effectively selectively neutral or 
nearly neutral (Figure 4). As an additional comparison 
we also computed the distribution of fitness effect for 
non-degenerate sites within protein-coding genes asso- 
ciated with lethal mutant phenotypes in D, melanogaster 
or associated with genetic diseases or syndromes in 
human. As expected for these two sets of sites we 
observed an increased proportion of sites classified as 
being highly deleterious {Nes >100) relative to non- 
degenerate sites from all remaining protein-coding 
genes. Once again the proportion is strikingly higher for 
the D. melanogaster set (70.92%) than it is for the 
human set (59.74%) of deleterious amino acid changes. 
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Figure 2 McDonald-Kreitman test for IncRNA exons and 
introns. Small intronic sequences (<86 nt, disregarding the first 6 nt 
and last 16 nt) were used as a proxy for neutrally evolving 
sequences. P <0.001, ns: not significant. 

v J 



Our estimates of the distribution of fitness effects of 
newly arising mutations within non-degenerate sites are 
in agreement with previous analyses conducted in 
human. Boyko et al [55] as well as Keightley and Eyre- 
Walker [54] identified between 22% and 34% of newly 
arising mutations within the African population as being 
selectively effectively neutral (our estimate: 26.69%). 

Discussion 

Previous between species comparisons predict IncRNAs 
to have evolved under a regime of purifying selection 
that is considerably weaker than for protein-coding 
sequences [13,28]-[31]. Because of their design, vir- 
tually all of these experiments consider evolutionarily 
ancient selective events. However by taking advantage 
of available sequenced genomes of individuals from 
within the same species, we can now: (1) infer the evolu- 
tion of these sequences at a considerably shorter time 
scale; (2) quantify more precisely the strength of recent 
or contemporaneous selection acting on IncRNAs; and 
(3) assess the distribution of fitness effect of new dele- 
terious mutations occurring within these sequences. 
From the reported importance of a limited subset of 
IncRNAs in gene regulation [23,25,26], it might have 
been expected that human IncRNAs would exhibit a 



weak signature of purifying selection at the population 
level. 

D. melanogaster intergenic IncRNA evolution 

Our results show that D. melanogaster intergenic 
IncRNAs are subject to moderately strong selective con- 
straints. SNPs occurring within fruitfly IncRNAs are 
characterised by an excess of rare variants relative to 
neutral sequences (either small introns or randomly 
sampled sites within flanking intergenic sequences), 
leading to a negative estimate of Tajima's D, and a L- 
shaped site frequency spectrum. We reached the same 
conclusion when considering the minor allele frequency 
or the derived allele frequency or when taking account 
of mutational biases (AT^GC, GC^AT). Although this 
effect could be explained by a recent population expan- 
sion [56], we reached identical conclusions when using 
an algorithm that estimates population parameters 
before testing for the distribution of fitness effect of 
newly arising mutations [54,57]. 

Our findings of fruitfly IncRNA constraint at the 
population level are confirmed at the interspecific level 
by comparing nucleotide conservation between IncRNA 
exons and introns, an extension to our previous findings 
[13]. LncRNA exons were shown to exhibit an inter- 
mediate level of conservation between protein-coding 
exons and intergenic sequences, while conservation of 
IncRNA introns does not differ significantly from that of 
intergenic sequence. 

These differences in conservation between Drosophila 
IncRNA exons and introns, as well as the observation of 
a greater proportion of low frequency variants within 
IncRNA exons relative to IncRNA introns, argue 
strongly for spliced transcripts being important for the 
function of many fruitfly IncRNAs and not RNA 
sequence-independent biological function as found for 
some IncRNA loci such as HSI and Airn [42,43]. 

In contrast to results for human IncRNAs (which con- 
firm our previous observations [28,31]) we found no sig- 
nificantly increased conservation for splice sites in 
Drosophila IncRNAs relative to randomly selected 'GT' 
and 'AG' dinucleotides within intergenic and intronic 
sequences. This lack of increased splice site conservation, 
despite an increased nucleotide conservation of the 
IncRNA exons, may indicate a rapid divergence of splicing 
elements within these long non-coding RNAs. This obser- 
vation could, however, also result from the mis- annotation 
of splice sites as a consequence of typically low sequence 
coverage for IncRNA models in RNA-Seq experiments. 

Human intergenic IncRNA evolution 

In contrast to evidence in flies, we found no evidence 
from human population data for widespread purifying 
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selection acting on IncRNA sequence, and only a weak 
signal of elevated sequence conservation between verte- 
brate species. Few human IncRNAs were as highly con- 
served as those from Drosophila (Additional File 9). 

As evidence for IncRNA sequence conservation across 
species is scarce, potentially orthologous transcripts 



transcribed with the same orientation and syntenic posi- 
tion relative to an orthologous protein coding locus 
have been identified among human, mouse and zebra- 
fish [32]. If such positionally equivalent IncRNAs are 
orthologous and retain ancestral function then purif)^ing 
selection acting on these loci might be expected to be 
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stronger than for the remaining IncRNAs. However, 
these positionally equivalent IncRNAs' sequence conser- 
vation across vertebrates, as well as their site frequency 
spectra, were found not to differ from those of a control 
set of human IncRNAs. Once again this highlights the 
weak selective constraints that have acted both recently 
and more historically on vertebrate IncRNAs. Accord- 
ingly, zebrafish IncRNAs with positional equivalents in 
human or mouse were found not to exhibit sequence 
conservation between these species [32]. 

The lack of evidence for strong or widespread purify- 
ing selection or the weak selective effect of mutations 
within non-coding sequences in human has been 
reported previously, although not specifically for tran- 
scribed non-coding sequence. Torgerson et al [58] com- 
pared polymorphisms in human within conserved 
intergenic sequences (>5 kb upstream and downstream 
of annotated transcripts) with synonymous site poly- 
morphisms and found no evidence for selection on 
intergenic conserved sequences. Likewise, Krukyov et al. 
[59] and Chen et al. [60] found that despite purifying 
selection acting on the most conserved non-coding ele- 
ments in human, of mutations within them have only 
weak effects on fitness. 

Why might fly intergenic IncRNA evolution differ from 
human intergenic IncRNA evolution? 

We estimated that an average of 35.82% of new muta- 
tions within D. melanogaster intergenic IncRNAs are 
effectively negatively selected. However, selection on all 
mutations within human intergenic IncRNAs, even 
those with a positional equivalent in mouse, was pre- 
dicted to be effectively neutral. 

Some of the observed differences in conservation and 
selection acting on IncRNAs between D. melanogaster 
and humans could be due to different origins of the two 
datasets. Our set of human IncRNAs was derived from 
adult tissues [41] whereas the fruitfly IncRNAs were 
identified from a developmental time course gene- 
expression analysis [9,13] and could therefore be subject 
to stronger selective constraints. Previous studies 
showed increased purifying selection on protein-coding 
genes expressed early during development relative to 
genes expressed during the adult stage [61]. 

A second explanation for the observed differences 
between D. melanogaster and human IncRNAs in con- 
servation and allele frequency distribution relates to dif- 
ferences in the effective population sizes of the two 
species. The influence of effective population size on the 
probability of fixation of a deleterious mutation is well 
documented [62]. According to the nearly neutral theory 
of molecular evolution,the probability of fixation of such 
a mutation is a function of ^Nefis {fi: mutation rate, s: 
selection coefficient), and thus a weakly deleterious 



mutation will be effectively neutral if the product of its 
selection coefficient (5) and the effective population size 
(Ne) is near to one [63-65]. There is a considerable dif- 
ference in estimated effective population sizes of D. mel- 
anogaster or H. sapiens-. 1,450,000 versus 1,200-15,000, 
respectively [66-68]. This results in a wide range of low 
selection coefficients 5 for which deleterious mutations 
have widely varying fixation probabilities between the 
two species. A deleterious mutation with a small selec- 
tion coefficient in human is likely to evolve essentially 
neutrally, while a mutation with the same selection coef- 
ficient in Drosophila wiU tend to be subject to stronger 
purifying selection. More formally any mutation with |5| 
> lINe human will be under the scrutiny of selection in 
either species while any mutation with lINe human > \ 
s\ > II Ne Drosophila will be under a selectively near 
neutral regime in human but will be under more effec- 
tive negative selection in D. melanogaster. According to 
the effective population size estimates cited above, the 
minimum value of s for selection to act on deleterious 
variants ranges from approximately 7 x 10-5 in human 
to three orders of magnitude lower, 7 x 10-8 in D. mel- 
anogaster. This difference in effective population size 
between human and Drosophila is a likely explanation 
of the striking differences in the DAF distributions of 
variants within IncRNAs in D. melanogaster and human. 

A third explanation might be that the repertoires of 
fruitfly or human IncRNA molecular mechanisms are 
very different, leading to differences in the signatures of 
selection in their IncRNA sequences. If this is indeed 
the case then we speculate that fruitfly IncRNA mechan- 
isms will be more critical to its biology than are IncRNA 
mechanisms to human biology. 

From these results testable predictions can be made 
regarding the evolution and conservation of IncRNA 
sequences. Deleterious mutations with a particular value 
of s within IncRNA in species with large effective popula- 
tion size, such as insects [59,69], are more likely to be 
purged leading to a greater sequence conservation. In 
contrast within species with low effective population size, 
such as human, weakly to mildly deleterious mutations 
are more likely to be fixed leading to a greater turn-over 
of non-coding transcribed sequences [33]. This effect 
explains the difference in the distribution of fitness effects 
of deleterious mutations at genes annotated with disease/ 
lethal phenotypes in human and fruitflies. 

Comparison with Ward and Kellis [36] 

Our conclusion that negative selection is highly ineffi- 
cient within human IncRNA variants appears to be at 
odds with evidence from Ward and Kellis that their var- 
iants exhibit a lower mean DAF than genomic samples 
[36]. This apparent discrepancy could not be explained 
by the different IncRNA sets being considered. This was 
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because results from our reanalysis of the Ward and 
Kellis IncRNA set from ENCODE were equivalent to 
those we report above. It could also not be explained by 
Ward and Kellis' [36] consideration only of SNPs of 
Yoruba origin, since when we re-ran our approach using 
only Yoruba SNPs, no substantive differences were 
found (Additional Files 10 and 11). Instead, we believe 
the discrepancy likely arises from the differences in the 
choice of proxy for neutral sequence. In our analysis, we 
account for the otherwise potentially confounding fac- 
tors of background selection and mutational variation 
by considering sites either within ancestral repeats 
that flank IncRNA loci, or within flanking intergenic 
sequence that has been masked for conserved sequence. 
By contrast, the approach of Ward and Kellis [36] sam- 
ples sites from concatenated unannotated intergenic 
sequences drawn from across all autosomes, and thus 
does not account for background selection or muta- 
tional rate variation. 

Although interspecies sequence conservation over long 
evolutionary time is rightly considered as an indicator of 
functionality, the lack of conservation within IncRNAs 
does not necessarily imply their lack of functionality 
[70]. Sequences encoding heart enhancers have been 
found to be as poorly conserved as randomly sampled 
sequence [71]. The accumulation of weakly to mildly 
deleterious mutations within poorly conserved sequence, 
such as human IncRNA loci, raises the question of how 
a population can carry an ever increasing burden of 
deleterious variants within loci that regulate gene 
expression? Previous hypotheses proposed that such 
sequences interact with only a limited number of factors 
or that only a very restricted proportion of sequence is 
required to convey biological function [70]. Others sug- 
gest that compensatory mutations within the locus 
maintain secondary structure [72] or similarly within the 
sequence of its interacting partner maintain molecular 
function. Such compensatory mechanisms [34] and net- 
work redundancy have been proposed to explain the 
rapid sequence evolution of IncRNAs and the absence of 
mutant phenotypes for some IncRNA knockout models. 
Finally, the accumulation of slightly deleterious muta- 
tions could also be explained by synergistic epistasis, 
when interactions between mutations produce a greater 
effect than expected from the sum of their independent 
effects. This hypothesis was first proposed to explain the 
mutational load paradox in species with low effective 
population sizes [73] but may also help to explain the 
accumulation of potentially deleterious mutations at 
synonymous sites [74] and within conserved non-coding 
sequences [59]. 

The inefficiency, or low degree, of selection acting on 
mutations within human IncRNAs suggests that for the 
great majority of these loci extensive phenotyping will 



be necessary to identify the potential deleterious effects 
of their disruption. Accordingly, several recent studies 
have reported that despite phenotypes being observed in 
cell-based assays for several IncRNA loci {HOTAIR, 
Malatl, Neatl), no overt phenotype (for example, litter 
size, body weight or viability) was found in the knockout 
mice under normal laboratory conditions ([34,75-78]). 

However an absence of overt phenotype in laboratory 
conditions does not necessarily imply that there is no 
deleterious effect of the knockout. Although the knock- 
out mice did not differ from the wild-type individuals, 
further analyses found evidence for phenotypes for Evf2 
[79], and Bel [80,81] mutants. Analyses in yeast and in 
worm have revealed that despite the observation of a 
lack of phenotype for a vast majority of the knockout 
mutants, fitness effects measured as population growth 
under a wide range of conditions are apparent for up to 
97% of Saccharomyces cerevisiae genes [82] and between 
42% and 60% of genes assayed in Caenorhabditis ele- 
gans. Finally, because IncRNAs are most often expressed 
at low levels in a developmental stage and/or tissue spe- 
cific manner this increases the difficulty of identifying 
potential phenotypes associated with their disruption. 

Conclusions 

Genetic drift appears to be the main driving force in the 
evolution of intergenic IncRNAs, at least in humans, as 
a consequence of our small effective population size. 
Therefore, weakly to mildly deleterious mutations are 
likely to have accumulated rapidly within intergenic 
IncRNAs. The consequences of such an accumulation 
on IncRNA function and on human biology have yet to 
be experimentally assessed. Our observations serve to 
highlight the pressing need for extending the study of 
these loci to in-vivo systems combined with extensive 
phenotyping. Our results support a less prominent bio- 
logical role for many of these non-coding loci than has 
been proposed previously [83,84]. 

Materials and Methods 

In all analyses that we describe below, calculated P values 
were corrected for multiple testing using a Bonferroni 
correction [85]. 

Our analysis in D, melanogaster was conducted on the 
set of 1,115 long non-coding intergenic RNAs defined 
by Young et al, [13] using polyA+-selected transcrip- 
tome data from the ModEncode Project [9] having 
excluded four loci owing to their overlap with recently 
predicted small open reading frames [86]. For compari- 
son we also analysed a set of 4,662 human IncRNAs 
identified by Cabili et al, [41] from polyA+-selected 
libraries using conservative criteria, namely one isoform 
reconstructed in at least two tissues or by two assem- 
blers [41]. 
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Because mono-exonic IncRNAs models are not 
stranded, we limited our analysis to multi-exonic loci. 
Furthermore, in order to avoid the confounding effects 
arising from selection acting on protein-coding genes 
we focused our analysis on intergenic IncRNA loci, 
instead of intronic, antisense or IncRNAs that overlap 
untranslated regions of protein-coding genes. 

We used the mouse IncRNAs annotated by Ensembl 
and by Belgard et al. [87] to identify positional equiva- 
lent IncRNAs between mouse and human. Using pro- 
tein-coding genes with 1-to-l orthologous relationships 
between human and mouse and flanking a IncRNA 
locus in both species, we defined as positional equiva- 
lents those IncRNAs that were found in the same tran- 
scriptional orientation and the same location relative to 
a protein-coding gene in both species. Furthermore, in 
order to take into account potential selection acting on 
the nearby protein-coding gene, we also identified a 
control set composed of IncRNAs flanking protein-cod- 
ing genes with 1-to-l orthologs but with different tran- 
scriptional orientations and/or positions relative to the 
protein coding gene. We identified 374 positional 
equivalents loci between human and mouse, and 802 
control IncRNAs. 

We collected 2,993 genes described as being involved 
in syndromes and genetic diseases from OMIM database 
[88,89]. Using the FlyBase database [90], we collated 
2,125 genes with lethal mutant phenotypes. 

D, melanogaster and human gene annotations and 
genomes were downloaded from FlyBase [90] (release 
5.39) and Ensembl [91] (release 64), respectively. 

Polymorphism data for 162 D, melanogaster strains 
from Raleigh, North Carolina were downloaded from 
the Drosophila Genetic Reference Panel [39,92,93]. Sites 
covered by at least 10 reads and without base ambiguity 
in at least 150 strains were retained for further analysis. 
A total of 3,172,754 sites across the five major chromo- 
somal elements were used for analysis. For the human 
dataset, we discarded SNPs within 10 bp of indel calls 
and chose a quality score threshold to give a 0.1% FDR. 
The allele frequencies for polymorphic sites were 
retrieved from the 1000 Genomes Project data. We col- 
lected 18,745,840 SNPs in 174 individuals of African ori- 
gin (a highly polymorphic population) called by the 1000 
Genomes Project Consortium [40,94]. 

For both datasets, we polarised the alleles into ances- 
tral or derived states using the pairwise alignments of D, 
melanogaster with D, simulans and D, yakuha, and of H. 
sapiens with the chimpanzee {Pan troglodytes) and 
macaque {Macaca mulatta) which are available from 
the UCSC genome database website [95]. We used max- 
imum parsimony to infer the ancestral state of each site, 
and ambiguous sites were removed from the final data- 
set. Using genome annotations, we collated sites found 



within exons and introns of protein-coding genes, 
IncRNA loci or intergenic sequences or ancestral repeats 
(transposable elements shared between human, mouse 
and rat) (Table [1]). 

Evolutionary rates and sequence conservation 

PhastCons scores [2] computed using the alignments of 
11 Drosophila species. Anopheles gambiae, Tribolium 
castaneum and Apis mellifera (whose divergence spans 
approximately 300 Mya) were downloaded from the 
UCSC database [95]. We computed the median phast- 
Cons scores for for each of 10 successive windows that 
each represents a 10% portion of IncRNA exon or intron 
sequence; exons or introns were further subdivided into 
'first', 'middle', 'last' or 'unique' classes with respect to 
their genomic position. We also collected 1,000 nt of 5' 
and 3' flanking intergenic sequences for both IncRNAs 
and protein coding loci. 

We computed, for each window, 95% confidence inter- 
vals using 10,000 bootstraps. As a control, we randomly 
selected intergenic sequences lying away (>1 kb) from 
any annotated gene whose size distribution matched that 
of the IncRNA exons or introns. One thousand such sets 
of control sequences were defined to permit confidence 
intervals to be calculated. For comparison this analysis 
was also performed on the set of protein-coding genes 
that flank IncRNA loci. 

This procedure was repeated for human IncRNA loci 
and their neighbouring protein-coding genes using 
phastCons scores computed using the alignments of 46 
vertebrate genomes from the UCSC database [95] 
(approximately 400M)/). 

In order to assess the difference in nucleotide conserva- 
tion between IncRNA exons and introns, we implemen- 
ted a resampling analysis in which we randomly sampled 
a single site per feature (exon or intron) within a locus. 
In total, 1,000 resampling analyses were performed. 

We estimated the conservation of the splice sites of 
both protein-coding and IncRNA loci in flies using the 
sequence alignments of 50 nucleotides upstream and 
downstream of the D. melanogaster splice sites with D. 
simulans, D, sechellia, D, yakuba and D, erecta. For 5' 
and 3' splice sites and the 20 adjacent intronic sites of 
protein coding genes and IncRNA loci we computed the 
information content using the Shannon- Weaver index. 

As control, we randomly selected 'GT' and 'AG' dinu- 
cleotides within intergenic sequences flanking the 
IncRNA loci and applied the same procedure. 

Polymorphism estimators 

We used VariScan [96] to compute polymorphism indi- 
cators {jiT , OW , Tajima's D). Genomic alignments with 
D. simulans and rhesus macaque for D. melanogaster 
and human, respectively, were used to compute the 



Haerty and Ponting Genome Biology 2013, 14:R49 
http://genomebiology.com/201 3/1 4/5/R49 



Page 13 of 16 



Jukes-Cantor corrected per site divergence (/c). To avoid 
any potential bias arising from local variations in recombi- 
nation rate, mutation rate, efficacy of selection or nucleo- 
tide composition, we limited our analysis to only those 
protein coding genes, small introns or ancestral repeats 
that are found in the neighbouring genomic regions of 
IncRNA loci (within 5 kb). Likewise in human we analysed 
IncRNA loci flanked by proximal (<10 kb) ancestral 
repeats and their flanking protein-coding genes. Similar 
conclusions were reached from analyses with distance 
thresholds of 5 kb and 20 kb (Additional Files 5 and 6). 

Similarly we compared the derived allele frequency of 
polymorphic sites within IncRNA exons or IncRNA 
introns to sites within small introns, non-degenerate 
sites and four-fold degenerate sites. 

Because the putatively neutral sites we used are not 
interdigitated with our sites of interest (such as IncRNA 
exonic nucleotides), there remains the possibility that 
our indicators of purifying selection are artificially 
inflated [97]. In order to take such biases into account, 
when considering N sites from each IncRNA locus asso- 
ciated with an intergenic flanking sequence (> 1,000 nt 
following the masking of conserved non-coding ele- 
ments with nucleotide identity >90% over >20 nt), we 
randomly sampled this number N sites from this 
masked flanking sequence to be used as a neutral proxy. 
For the study of non-degenerate sites, we used four-fold 
degenerate sites within the same protein as a neutral 
proxy in human. However, because there is evidence for 
selection having acted on four-fold degenerate sites in 
Drosophila, we instead used small introns (<86 nt) as 
our neutral proxy and limited our analysis to just those 
protein-coding genes which contain such small introns. 
This analysis permits the strength of selection acting on 
IncRNAs to be estimated while controlling for variations 
in the local mutation rate, as well as background selec- 
tion associated with nearby functional elements includ- 
ing protein-coding genes and well conserved non- 
transcribed non-coding regulatory elements. We used 
this methodology to assess the degree of selective con- 
straints acting on intergenic IncRNAs through a general- 
ised McDonald-Kreitman test [98-100]. We compared 
the numbers of polymorphic over divergent sites within 
IncRNA exons and IncRNA introns to the numbers 
observed within sampled putatively neutral sites using a 
2^2 test with one degree of freedom. 

For either D. melanogaster or human IncRNAs, we 
used the site frequency spectra of mutations occurring 
within the sampled putatively neutral sites to estimate 
the distribution of fitness effect of new deleterious 
mutations within IncRNAs (in terms of -Nes) using 
DFE-alpha [54,57,103]. Confidence interval values for 
the proportion of sites under the different Nes categories 



were estimated through 200 bootstraps per locus. This 
analysis should therefore also take into account the 
effects of background selection as for each locus a 'neu- 
tral' reference is drawn from the same region. 

Statistics 

Comparisons between locus classes for the polymorph- 
ism estimators were performed using Kruskal-Wallis 
tests. The minor and derived allele frequencies distribu- 
tions for each class were compared using Kolmogorov- 
Smirnov tests. 

Additional material 



Additional File 1: Average phastCons scores across protein-coding 
(blue) and IncRNA (red) gene models in D. melanogaster (A) and human 
(B, C). Two hundred evenly-spaced nucleotides were randomly sampled 
per feature. The gray lines represent the 95% confidence intervals 
computed over 1,000 resampling. Average phastCons score for IncRNAs 
in human was computed over 200 randomly selected equidistant 
nucleotides within each of the categories. Confidence intervals were 
computed using 1,000 resampling of the data. 

Additional File 2: Median sequence conservation (phastCons) score 
across protein coding (blue) and positionally equivalent (PE) 
IncRNA (red) in human. 

Additional File 3: Comparison of protein-coding (blue) and IncRNA (red) 
5' (A) and 3' (B) splice site conservation in D. melanogaster . Only protein 
coding sequences flanking IncRNAs were used in the analysis. The 
control set is based on the random selection of 'GT' and 'AG' 
dinucleotides within the intergenic sequence flanking the IncRNAs in D. 
melanogaster. The Shannon-Weaver index was computed for each site 
using the alignments of each splice site and its neighbouring sequences 
with D. simulans, D. sechellia, D. yakuba and D. erecta with Muscle [102]. 

Additional File 4: Distribution of the distances between consecutive 
SNPs within protein coding (black) and IncRNA (red) exons in D. 
melanogaster. 

Additional File 5: Average (standard deviation) polymorphism 
estimates for IncRNA and their flanking protein coding genes in 
human. PE: positional equivalent. A maximum distance threshold 
between IncRNA loci and ancestral sequences of 5 kb was applied. 

Additional File 6: Average (standard deviation) polymorphism 
estimates for IncRNA and their flanking protein coding genes in 
human. PE: positional equivalent. A maximum distance threshold 
between IncRNA loci and ancestral sequences of 20 kb was applied. 

Additional File 7: Comparison of derived allele frequency distribution of 
SNPs at non-synonymous sites (dark blue), within 3' UTR (yellow), IncRNA 
exons (red), 5' UTR, at four-fold degenerate sites (light blue), and within 
small introns in D. melanogaster. 

Additional File 8: Derived allele frequency spectra for 0-fold, four-fold 
degenerate sites, sites within IncRNA, sites upstream (400 nt) IncRNAs 
and protein coding genes in D. melanogaster (A) and human (B). 

Additional File 9: Distribution of average conservation scores for 
intergenic IncRNAs in human. 

Additional File 10: Comparison of derived allele frequency 
distribution of SNPs at 0-fold degenerate sites (blue), GENCODE 
IncRNA exons (red), ancestral repeats (green) and four-fold 
degenerate sites (light blue) in human. 

Additional File 11: Comparison of derived allele frequency 
distribution of SNPs at 0-fold degenerate sites (blue), GENCODE 
IncRNA exons (red), ancestral repeats (green) and four-fold 
degenerate sites (light blue) in individuals of Yoruba origin. 
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